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Abstract Histograms are used to summarize the contents of relations into 
a number of buckets for the estimation of query result sizes. Several tech- 
niques (e.g., MaxDiff and V-Optimal) have been proposed in the past for 
determining bucket boundaries which provide accurate estimations. How- 
ever, while search strategies for optimal bucket boundaries are rather so- 
phisticated, no much attention has been paid for estimating queries inside 
buckets and all of the above techniques adopt naive methods for such an 
estimation. This paper focuses on the problem of improving the estimation 
inside a bucket once its boundaries have been fixed. The proposed tech- 
nique is based on the addition, to each bucket, of 32-bit additional informa- 
tion (organized into a 4-level tree index), storing approximate cumulative 
frequencies at 7 internal intervals of the bucket. Both theoretical analysis 
and experimental results show that, among a number of alternative ways 
to organize the additional information, the 4-lcvel tree index provides the 
best frequency estimation inside a bucket. The index is later added to two 
well-known histograms, MaxDiff and V-Optimal, obtaining the non-obvious 
result that despite the spatial cost of 4LT which reduces the number of al- 
lowed buckets once the storage space has been fixed, the original methods 
are strongly improved in terms of accuracy. 

Key words histograms ~ range query estimation ~ approximate OLAP 
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1 Introduction 

A histogram is a lossy compression technique used for representing efficiently 
a relation. It is based on the partition of one of the relation attributes into 
buckets and the storage, for each of them, of a few summary information in 
place of the detailed one. Among others, some important examples of ap- 
plication domains of histograms are the estimation of query selectivity 1121 
^,18,13,22 , temporal databases, where histograms are used for improving 
the join processing [20], statistical databases, where histograms represent 
a method for approximating probability distributions |15j . Recently, his- 
tograms have received a new deal of interest, mainly because they can be 
effectively used for approximating query answering in order to reduce the 
query response time in on-line decision support systems and OLAP jl7| , as 
well as the problem of reconstructing original data from aggregate informa- 
tion 12] and, finally, in the context of Data Streams |MTl[7lllO| . 

For a given storage space reduction, the problem of determining the 
best histogram is crucial. Indeed, different partitions lead to dramatically 
different errors in reconstructing the original data distribution, especially 
for skewed data. To better explain the problem, consider a typical case of 
recovering original data from a histogram: the evaluation of range queries. 
Think to a histogram defined on the attribute X of a relation i? as a set of 
non-overlapping intervals of X covering all values assumed by X in R. To 
each of these intervals, say B, the number of occurrences (called frequency) 
in R, having the value of X belonging to the interval B, is associated (and 
included into a data structure called bucket). A range query, defined on an 
interval Q of X, evaluates the number of occurrences in R with value of 
X in Q. Thus, buckets embed a set of pre-computed disjoint range queries 
capable of covering the whole active domain of X in i? (with active here 
we mean attribute values actually appearing in i?). As a consequence, the 
histogram does not give, in general, the possibility of evaluating exactly 
a range query not corresponding to one of the pre-computed embedded 
queries. In other words, while the contribution to the answer coming from 
the sub-ranges coinciding with entire buckets can be returned exactly, the 
contribution coming from the sub-ranges which partially overlap buckets 
can be only estimated, since the actual data distribution inside the buckets 
is not available. 

It turns out that it is convenient to define the boundaries of buckets in 
such a way that the estimation error of the non-precomputed range queries 
is minimized (e.g., by avoiding that large frequency differences arise inside 
a bucket). In other words, among all possible sets of pre-computed range 
queries, we find the set which guarantees the best estimation of the other 
(non-precomputed) queries, once a technique for estimating such queries is 
defined. This issue is being investigated since some decades, and a large 
number of techniques for arranging histograms have been proposed ||51IHlll2[ 
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All these techniques adopt simple methods for estimating non-precomputed 
queries (actually, their portions partially overlapping buckets). The most 
significant approaches are the continuous value assumption (often denoted 
in this paper by CVA) |19| . where the estimation is made by linear in- 
terpolation on the whole domain of the bucket, and the uniform spread 
assumption (denoted by USA) 18 , which assumes that values are located 
at equal distance from each other so that the overall frequency sum can be 
equally distributed among them. 

An interesting problem is understanding whether, by exploiting infor- 
mation typically contained in histogram buckets, and possibly adding a few 
summary information, the frequency estimation inside buckets, and then, 
the histogram accuracy, can be improved. This paper focuses on this prob- 
lem. Starting from the consideration of limits of CVA and USA studied in 
P], we propose to use some additional storage space in order to describe the 
distribution inside a bucket in an approximate yet very effective way. 

The first step is studying how to use these 32 additional bits in order to 
maximize benefits in terms of accuracy. Our analysis shows that the trivial 
technique of partitioning the bucket into 8 equal-size parts and encoding 
each corresponding sum by 4 bits, leads to high scaling errors since it is 
needed to represent each sum as a fraction of the overall sum of the bucket. 
Our proposal then relies on the idea of storing partial sums internal to the 
bucket in a hierarchical fashion, using a tree- like index (occupying 32 bits). 
This way, the sum contained in a given tree node, can be represented as 
a fraction of the sum contained in the parent node, which is a value (rea- 
sonably) smaller than the overall sum of the bucket. It turns out that the 
encoding length may decrease as the level of the tree increases. The benefits 
we expect by applying this approach concern the scaling error. But a crucial 
point is to decide how to arrange the tree, that is, how far going down in 
depth with the index. Of course, the higher the resolution, the larger the 
number of embedded precomputed range queries (internal to the buckets) is. 
Hence, we expect better accuracy as the resolution increases. However, in- 
creasing resolution reduces the number of bits available for encoding nodes, 
and, thus, amplifies scaling errors. We study the above trade-off by con- 
sidering the two possible (from a practical point of view) tree-indices with 
32 bits, which we call 3LT and 4LT, with depth 3 and 4, respectively. The 
analysis leads to the conclusion that the 4LT-index represents the best so- 
lution. 

The next step is then understanding whether this improvement of accu- 
racy for the estimation inside buckets can really give benefits in terms of 
accuracy of a histogram arranged by one of the existing techniques. This 
problem is not straightforward: think, to mention the most evident aspect, 
that 4LT buckets use 32 bits more than CVA ones, and, then, for a fixed 
storage space, allows a smaller number of buckets. The last part of this 
paper is thus devoted to evaluate the effects of the combination of the 4LT 
technique with existing methods for building histograms. Through a deep 
experimental comparative analysis conducted, for a fixed storage space, over 
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several data sets, both synthetic and real-hfe, we show that 4LT improves 
significantly the accuracy of the considered histograms. Therefore this pa- 
per, beside giving the specific contribution of proposing a technique (i.e., 
the 4LT) for estimating accurately range queries internal to buckets, proves 
the more general result that going beyond classical techniques (i.e., CVA 
and USA) for the estimation inside buckets may give concrete improvements 
of histogram accuracy. 

It is worth noting that the choice of MaxDiff and V-Optimal histograms 
for testing our method does not limit the generality of the 4LT index, which 
is applicable to every bucket-based histogram^. Nevertheless, it is not lim- 
ited the validity of our comparison, since MaxDifF and V-Optimal, despite 
their non-young age, are still considered in this scientific community as point 
of references due to their accuracy . 

The paper is organized as follows. In Section [51 we introduce some pre- 
liminary definitions. The comparison, both experimental and theoretical, 
among a number of techniques including our tree-based methods (3LT and 
4LT) for estimating range queries inside a bucket is reported in Section |31 
Therein, 3LT and 4LT are also presented. From this analysis it results that 
4LT has the best performances in terms of accuracy. Thus, 4LT can be com- 
bined to every bucked-based histogram for increasing its accuracy. Section 

01 presents a large set of experiments, conducted by applying 4LT to two, 
well-known methods, MaxDiff a,nd V-Optimal Results show high im- 
provements in the estimation of range queries w.r.t. to the original methods 
— of course, the comparisons are made at parity of storage consumption 
so that the revised methods use less buckets to compensate the additional 
storage for the 4LT indices. The 4LT technique provides good results also 
when combined with the very simple method EquiSplit, which consists in 
dividing the histogram value domain into buckets of the same size so that 
the bucket boundaries need not to be stored, thus obtaining a very high 
number of buckets at the same compression rate. We draw our conclusions 
in Sectional 

2 Basic Definitions 

Given a relation R and an attribute X oi R, a. histogram for R on X 
is constructed as follows. Let U — {ui,.. be the set of all possible 
values (the domain) of X and let Ui < Wi+i, for each i, I < i < m. The 
frequency set for X is the set J- = {/(ui), /(um)} such that for each i, 
I < i < m, f{ui) is the number of occurrences of the attribute value Ui 
in the relation R. The cumulative frequency set S = {si,...,Sm} contains 
the value Si = X)j=i fi'^j) fo'^ each attribute value Ui. The value set V = 
{ui G U I f{ui) > 0} is the active domain of X in i? as it consists of 
all attribute values actually occurring in the relation R (non-null values). 

^ There are histograms, like wavelet-based ones, that are not based on a set of 
buckets. 
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Given any m in V, the spread di oi Ui G V ioi 1 < i < n is defined as 1 if 
Ui is the last non-null value or otherwise as the difference uj — Ui, where uj 
is the first non-null value for which Uj > Ui (i.e., di is the distance from Ui 
to the next non-null value). 

A bucket B for i? on X is a 4-tuple {inf, sup, t, c), where Uinf and Usup, 
1 < inf < sup < m, are the boundaries of the domain range pertaining to 
the bucket, t is the number of non-null values occurring in the range, and 
c = J2i^nf is the sum of frequencies of all values in the range. We 
say that the bucket B is 1-biased if Usup is not null; if also Uinf is not null, 
then wc say that B is 2-biased. 

A histogram H for i? on X is a /i-tuple {Bi,B2, ■■■,Bh) of buckets such 
that: (1) for each 1 < i < h, the upper bound of Bi precedes the lower bound 
of -Bi+i and (2) u € V implies u S Bi, for some i, I < i < h. Condition 
(1) guarantees that buckets do not overlap each other, and condition (2) 
enforces that every non-null value be hosted by some bucket. Classically, 
histograms have 2-biased buc;kc;ts: sometime, for storage optimizations, 2- 
biased buckets are made 1-biased by replacing the lower bound of each 
bucket with the successive in the domain of the upper bound of the preceding 
bucket. 

A classical problem on histograms is: given a histogram H and a (range) 
query of the form Uj < X < Ui, 1 < j < i < m, estimate the overall 
frequency Y^^^j f{i) in the range from uj to Uj. 

3 Estimation Inside a Bucket 

In this section wc deeply investigate the problem of frequency estimation 
inside buckets. First of all, we present the classical two techniques (CVA and 
USA), discuss their limitations and propose some simple alternatives. Then 
we introduce a novel technique which is based on a 4-level tree index storing 
approximate representations of the partial sums of 7 fixed bucket intervals. 
Later we evaluate the accuracy of the various techniques by performing both 
a theoretical analysis of errors and a number of experiments on some typical 
sample distributions. 

3.1 Notations and Problem Formulation 

Let B = {inf, sup, t, c) be a bucket on an attribute X of a relation R. With- 
out loss of generality, we assume that inf = 1 and sup — b so that we can 
represent the frequency set inside the bucket as a vector F with indexes 
ranging from 1 to 6 {frequency vector of B). Similarly, the cumulative fre- 
quencies are represented by a vector S with indexes from 1 to 6 {cumulative 
frequency vector ofB). Hence, for each i,l <i <b, F[i\ > is the frequency 
of the value Ui while S[i] = X]j=i ^b] ^^"^ cumiilative frequency. Then 
c = S\b] is the sum of all frequencies in the bucket; moreover, for notation 
convenience, we assume that S'fO] = 0. 
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The problem of the estimation inside a bucket can be formulated as 
follows: given any pair i,j,l<i<j< 6, such that d = j — i + 1 < b, 
estimate the range query S[j]~ S[i~l] — J2k=i -^[^]- We focus our attention 
on the basic problem of estimating S[d] (then by assuming i — 1). 

We introduce now the following notation. Given 1 < i < j < 8, we 
denote by S^/j the sum X)f=a; -^M ' where x = 1 + [| • (i — 1)] and y — ' *1 • 
Si^j represents the frequency sum of the j— th elements of the partition of B 
into j equal size sub-ranges. Thus, the frequency sum for a bucket is 
the frequency sums for two halves are S1/2 and 62/2] the frequency sums for 
the 4 quarters are (Si/4, 1 < « < 4; the frequency sums for the 8 eighths are 
Si/8j 1 * 5: 8, and so on. 

3.2 Estimation Techniques 

Next we illustrate the existing approximation techniques and discuss some 
additional simple approaches. 

Continuous Value Assumption (CVA). The estimation of S[d] is com- 
puted as S[d] — ^ • c. In words, the partial contribution of a bucket to 
a range query result is estimated by linear interpolation. As pointed out 
in 012], the above estimation coincides with the expected value of the S[d] 
when it is considered a random variable over the population of all frequency 
distributions in the bucket for which the overall cumulative frequency is c. 
Uniform Spread Assumption (USA). The estimation of S[d] is given 

by S'[d] = ^1 + ^^—^^izjy^^ ■ f , where t is the number of non-null attribute 
values in the bucket. The uniform spread assumption assumes that such 
values are distributed at equal distance from each other and the overall fre- 
quency sum is equally distributed among them. Obviously, in this case the 
information t is necessary. We stress that, as discussed in [21, this estimation 
is not supported by any unbiased probabilistic model so the assumption is 
rather arbitrary. 

1- Biased Estimation (lb). The possibly available information on the 
number t of non-null elements cannot be exploited in the estimation unless 
some further information on the frequency distribution is either available or 
assumed (as for the USA estimation). We next show how to exploit the fact 
that a bucket is often 1-biased (i.e., ut is not null) using the probabilistic 
approach proposed in T . This approach assumes that the query is a random 
variable on the population of all 1-biased frequency distributions having c 
as overall cumulative frequency. The estimation of the range query S[d] for 
a 1-biased bucket is given by S[d] = ■ ■ c. 

2- Split Estimation (2s). We split the bucket into two parts of the same 
size and store the cumulative frequency of the first part, say S1/2 ~ S[b/2] 
— we therefore need additional storage space (typically 32 bits). We call 
this method 2-split or 2s for short. Following this approach, the estimation 
of the range query S[d\ is given by 2 • | -151/2 if <^ < |: ^1/2 + 2 • • (c~^i/2): 
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otherwise. Thus we use the CVA techniques for each of the two halves of 
the bucket. 

4-Split Estimation (4s). We spUt the bucket into 4 parts of the same 
size {quarts) and store the approximate values of the cumulative frequency 
of the each part 5^/4,, 1 < i < 4. In case the additional available space 
is 32 bits, we use 8 bits for each approximate value, which is therefore 
computed as Si/^ = x (2* — 1)), where (x) stands for round{x). The 
frequency sum for an interval d is estimated by adding the approximate 
values of all first quarts that are fully contained in the interval plus the 
CVA estimation of the portion of the last eighth that partially overlaps 
the interval. Obviously, in order to reduce the approximation error, in case 
d > 6/2, it is convenient to derive the approximate value from the estimation 
of the cumulative frequency in the complementary interval from d+1 to b. 
8-Split Estimation (8s). It is analogous to the 4-Split Estimation. The 
only difference is that the bucket is divided into 8 parts (eighths) and, for 
each of them, we use 4 bits for storing the cumulative frequency. Thus, the 
approximate value of the i-th eight (1 < i < 4) , is computed as S^/g = 

(-7^ X (2** — 1)), where (x) stands for round{x). 

3.3 The Tree Indices for Bucket Frequency Estimation 

We now propose to use 32 bits as sophisticated tree-indices for providing an 
approximate description of the cumulative frequencies in the bucket — this 
index can be easily extended also to the case that more bits are available. 
To this end, we store the approximate value of the cumulative frequency in 
a suitable number of intervals inside the bucket. The first type of tree-index 
is 3LT. 

3 Level Tree index (3LT) The 3LT index uses 11 bits for approximating 
the value of 61/2, and 10 bits both for approximating 61/4^ and for 5^/^. 

Let Li/2 be the Il-bits string corresponding to ^1/2, and let L1/4 and 
L3/4 be the 10-bits strings corresponding, respectively, to (5i/4 and S^/i. 

The three L strings are constructed as follows: 

ii/2 = (l7^-(2"-l)>; Li/4 = (J^-(2i°-l)); L3/4 = (g^-(2i°-l)) 

where, we recall, {x) stands for round{x). 

The approximate values for the partial sums are given by: 

^1/1 = <5i/i = s 

^1/2 = 211-1 ■ = — 61/2 

^1/4 = 2W3r ""^l/z; S2/4 = 5l/2—Sl/4; (53/4 = jTOTTi •'^2/2; <54/4 



= 52/2— <53/4 



Observe that the 32 bits index refers to a 3-level tree whose nodes store 
directly or indirectly the approximate values of the cumulative frequencies 
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32 bits 



5i;i 
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8i;2 



1x11 bits 5594 (5596) 
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14 15 16 



Fig. 1 The 3-level tree. 



for fixed intervals: the root stores the overall cumulative frequency c, the 
two nodes of the second level store the cumulative frequencies for the two 
halves of the bucket and so on. 

Example 1 Consider the 3-level tree in Figure □ The 32 bits store the fol- 
lowing approximate cumulative frequencies: L1/2 = (f|f| ' 2047) = 1320, 
Li/A = (ii • 1023) = 518, L3/4 = (serai ' 1023) = 935. 

We are now ready to solve the frequency estimation inside the bucket B. 
Given d, 1 < d < &, let i be the integer for which [(i — 1)/4-6] < d < [z/4-6] . 
Then the approximate value of F[d] is: 

where 



61/4 if i = 2 



otherwise 

Thus we use the interpolation based on the CVA only inside a segment of 
length [(1/4) • 6] . This component becomes zero at each distance d = \i- j], 
1 < i < 4. 

32 bits may be distributed in such a way that the granularity of the 
tree-index increases w.r.t. 3LT. 4LT index has 4 levels and uses 6 bits for 
the first level, 5 bits for the second one and 4 bits for the last level. 
4 Level Tree index (4LT) We reserve 4 bits to store the approximate 
value of each of the following 4 partial sums: i5i/8, <^3/8i ^5/8 Sj/g, — let 
Li/si * = Ij 3, 5, 7, denote such 4-bits strings. We then use the remaining 16 
bits as follows: the partial sums 61/4 and S3/4 are approximated by the 5-bit 
strings ii/4 and L^/4, respectively, while the partial sum S1/2 with a 6-bits 
string Li/2- As a result, the larger the intervals, the higher is the number 
of bits used. The 8 L strings are constructed as follows: 
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Fig. 2 The 4-level tree. 



10 11 12 13 14 15 



where, we recall, (x) stands for round(x). 

The approximate values for the partial sums are eventually computed 

as: 



^1/1 
^1/2 
£2/2 



263T 

<5l/2 
<5i-l/4 



(^1/1 



24-1 



X (5 



'j74 



-1/8 



1 Aj 


= 1), 


i 


-3Ai 


-2) 


2Aj 




i 


= 4Ai 


= 2) 


1 AJ■ 


= 1), 


i 


= 3Ai 


= 2) 


5AJ■ 


-3), 


i 


- 7Ai 


-4) 


2 A j 


= 1), 


[i 


= 4 A j 


= 2) 


6Aj 


= 3), 


[i 


-8Aj 


= 4) 



Similarly to the 3LT-index, the 4LT-index refers to a 4-level tree whose 
nodes store directly or indirectly the approximate values of the cumulative 
frequencies for fixed hierarchical intervals starting from the root which stores 
the overall cumulative frequency c. 

Example 2 Consider the 4-level tree in Figure[21 The 32 bits store the follow- 
ing approximate cumulative frequencies: L1/2 — 33, -/ji/4 — 18, ^3/4 = 13, 

-^1/8 — 6, -^^3/8 = II7 ^5/8 = 5^ ^7/8 = 7. 

Again, similarly to the 3LT-indcx, the frequency estimation inside the 
bucket B can be obtained by exploiting the content of the nodes of the index. 
Given d, 1 < d < b, and the integer i which \{i — l)/8 x 6] < d < [i/8 x 6], 
the approximate value of F[d] is: 

d-[(»-l)/8xb] 



F[d]=Pi^) + P'i^)+P"{^) + J^ 



xf)l-r(i-l)/8x()l ^ 



where 



fx -f ■ ^ A I '^1/4 if i = 3,4 

otherwise 
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P"{^) 



Si-i/s if i is even 
otherwise 



Thus we use the interpolation like in CVA only inside a segment of length 
[(1/8)6]. This component becomes zero at each distance d = \i x b/S], 
1 <i <8. We call the estimation 4-level tree or 4LT for short. 



3.4 Worst-case Error Analysis 

The approximation error for CVA, lb, USA and 2s arises only from interpo- 
lation. On the contrary, for other methods (i.e., 4s, 8s, 3LT and 4LT), the 
scaling error due to bit saving is added to the interpolation error. However, 
all methods but CVA, lb and USA implement a cqui-sizc division of the 
bucket and 3LT and 4LT provide also an index over sub-buckets. We expect 
that such a division into sub-buckets produces an improvement from the side 
of the interpolation error. Indeed, sub-buckets increase the granularity of 
summarization. In addition, we expect that index-based methods (i.e., 3LT 
and 4LT), reduce the scaling error, since hierarchical tree-like organization 
allows us to represent the sum inside a given sub-bucket, corresponding to 
a node of the tree, as a fraction of the sum contained in the parent node, 
instead of a fraction of the entire bucket sum (as it happens for the "flat" 
methods 4s and 8s). The worst-case analysis confirms the above observa- 
tions. In particular we show that while CVA, lb and USA are the same, 
under the worst-case point of view, 4LT outperforms the other methods. 

Results of our analysis are summarized in the following theorem. Recall 
that, throughout the whole section, a bucket B of size b is given. 



Theorem 1 Let F be the maximum frequency value occurring in B and let 
assume that b mod 8 = 0. Then, the interpolation and scaling worst-case 
errors of CVA, lb, USA, 2s, 4s, 8s, 3LT and 4LT are the following: 



error /m,ethod 


CVA 


lb 


USA 


2s 


4s 


8s 


3LT 


4LT 


interpolation 


h'-b 
4 


h'b 
4 


h'b 

4 


F-b 

8 


Fb 


Fb 


F-b 


Fb 


scaling 






















total 


h'-b 
4 


4 


F-b 
4 


Fb 

s 


Ifi 


Ifi 


Ifi 


?.2 



Proof Let 6m the size of the smallest sub-bucket produced by the method 
M, where M is cither CVA, lb, USA, 2s, 4s, 8s, 3LT or 4LT. Observe that 
bM = b for CVA, lb and USA (since they do not produce sub-buckets), 
while 62s = 5, 6m = I for M = 4s or M = 3LT, 6m = | otherwise. 

Consider first the interpolation error (by assuming that no scaling error 
occurs). 

Interpolation error bounds. It can be easily verified that the worst case 
for a method M happens whenever both the following conditions hold: 

(1) there is a smallest sub-bucket, say B (of size 6m) containing, in the first 
half, ^ frequencies with value F, and, in the second half, ^ frequencies 
with value 0, and 
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(2) the range query involves exactly the first half of the sub-bucket B. 

The proof of this part is conducted separately for each method, by deter- 
mining the maximum absolute interpolation error: 

CVA: In this case, 6m = 6, that is the sub-bucket coincides with the entire 
bucket and the query boundaries are 1 and |. The cumulative value of the 

bucket is F- 1. Under CVA, the estimated value of the query is ■ |, that 
is . The actual value of the query is ^ . Therefore the absolute error is 

Fb 
4 • 

lb: We obtain the same absolute error Indeed, being the first value of 
the bucket F (i.e., not null), 1-biased estimation does not give additional 
information w.r.t. CVA. 

USA: Also in this case, 6m = 6, that is the sub-bucket coincides with the 

entire bucket and the query boundaries are 1 and |. The cumulative value 
ofthe bucket is USA assumes that the | non null values are located at 
equal distance from each other, and each has the value -F. As a consequence 
the estimated value of the query is F ■ j, since the query involves just half 
non null estimated values. The actual value is Thus, the absolute error 
is that is the same as CVA. 

2s: In this case 6m = |- According to the case CVA, the absolute error is 
that is ^. 

4s and 3LT: Both 4s and 3LT produce sub-buckets of size |. Thus, in these 
cases 6m = I- Identically to the previous case, the absolute error is , 
that is 

8s and 4LT: Both 8s and 4LT produce sub-buckets of size |. Thus, in these 
cases 6m = |- Identically to the previous case, the absolute error is , 
that is 

Now we consider the scaling error. 
Scaling error bounds. The proof that CVA, lb, USA and 2s do not 
produce scaling error is straightforward. Let us consider the other methods: 
4s: Since each sub-bucket sum is encoded by 8 bits and is scaled w.r.t. the 
overall bucket sum, the maximum scaling error is . 
8s: Since each sub-bucket sum is encoded by 4 bits and scaled w.r.t. the 
overall bucket sum, the maximum scaling error is ^ = ^ . 
3LT: In this case, the scaling error may be propagated going down along the 
path from the root to the leaves of the tree. We may determine an upper 
bound of the worst-case error by considering the sum of the maximum 
scaling error at each level. Thus, we obtain the following upper bound: 

F-b I F b 

^ oil ^ ■ Indeed, the maximum scaling error of the first level is |t^. The 
above value is obtained by considering that the maximum sum in the half 
bucket corresponding to the first level is and that going down to the 
second level introduces a maximum scaling error obtained by dividing the 
overall sum by 2^^. Thus, the maximum scaling error for 3LT is 0(|t^) 
(that is, the scaling error of the first level). 
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4LT: For 4LT can be applied the same argumentation as 3LT, by obtaining 
that the maximum scahng error is of the same order as the first level. That 
is, 0{^), since the first level uses 6 bits. 
The proof is thus completed. 

It is worth noting that, as expected, 4LT and 8s produce the smallest 
interpolation worst-case error, that is Considering also the results about 
scaling error, the overall conclusion we may draw from the above analysis is 
that the best two methods w.r.t. interpolation, that is 8s and 4LT, are not 
the same in terms of scaling error. Indeed 4LT shows a relevant accuracy 
improvement since the error goes from of 8s to of 4LT. 

In the next subsection we shall perform a number of experiments to 
provide additional arguments in favor of the superiority of 4LT estimation, 
by performing also an average-case analysis of methods under a number 
of meaningful data distributions. We shall not conduct experiments on the 
CVA because we are aware that CVA uses 32 bits less and, therefore, could 
reduce the size of the bucket, thus providing a better accuracy. Actually, 
the performance analysis coincides with the one of 2s estimation, that is 
CVA in half bucket. 

3.5 Experiments inside a Bucket 

In this section we report the results of a large number of experiments per- 
formed with various synthetic data sets obtained with different distribu- 
tions. We measure the accuracy of all the above mentioned methods in 
estimating range queries inside a bucket. In particular, the methods consid- 
ered are: USA, lb, 2s, 8s, 3LT and 4LT. We observe that the space required 
for storing a bucket is the same for all the considered methods. Experiments 
are conducted on synthetic data generated according several data distribu- 
tions. A data distribution is characterized by a distribution for frequencies 
and a distribution for spreads. Frequency set and value set are generated 
independently, then frequencies are randomly assigned to the elements of 
the value set. 

3.5.1 Test Bed. In this section we illustrate the test bed used in our ex- 
periments. In particular, we describe (1) the data distributions, that is the 
probability distributions used for generating frequencies in the tested buck- 
ets, (2) the bucket populations, that is the set of parameters characterizing 
bucket used for generating them under the probability distributions, (3) the 
data sets, that is the set of samples produced by the combination of (1) and 
(2), (4) the query set and error metrics, that is the set of query submitted 
to sample data and the metrics used for measuring the approximation error. 
Data Distributions: We consider four data distributions: (1) Zipf-cuspjmax 
(0.5,1.0): Frequencies are distributed according to a Zipf distribution 
with the z parameter equal to 0.5. Spreads are distributed according to a 
Zipf cuspjmax (i-e., increasing spreads following a Zipf for the first half 
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elements and decreasing spreads following a Zipf distribution for the remain- 
ing elements) with z parameter equal to 1.0. (2) Zipf -cusp snax (1.0, 1.0). 
(3) Zipf-cuspjnax(1.5,1.0). (4) Gauss-ranrf: Frequencies are distributed ac- 
cording to a Gauss distribution with standard deviation 1.0. Spreads are 
randomly distributed as well. 

Bucket Populations: A population is characterized by the values of c 
(overall cumulative frequency), b (the bucket size) and t (number of non- 
null attribute values) and consists of all buckets having such values. We 
consider 9 different populations divided into two sets, that are called t-var 

and b-var, respectively. 

Set of populations t-var. It is a set of 6 populations of buckets, all of them 
with c = 20000 and b = 500. The 6 populations differ on the value of the 
parameter t (t=10, 100, 200, 300, 400, 500), and are denoted by t-var(lO), 
t-var(lOO), t-var(200), t-var(300), t-var(400) and t-var(400), respectively. 
Set of populations b-var. It is a set of 4 populations of buckets, all of them 
with c = 20000. They differ on the value of the parameters b and t. We 
consider 4 different values for b (6=100, 200, 500, 1000). The number of 
non-null values t of each population is fixed in a way that the ratio t/b 
is constant and equal to 0.2; so the values of t are 20, 40, 100 and 200. 
The four populations are denoted by b-var (100), b-var(200), b-var(500) and 
b-var(lOOO). 

Moreover, a generic population whose parameter values are, say, c, b and 
t (for c, b and t, respectively), is denoted by p(c, b, t). 
Data Sets: As a data set we mean a sampling of the set of buckets be- 
longing to a given population following a given data distribution. Each data 
set inchided in the experiments is obtained by generating 100 buckets be- 
longing to one of the populations specified above under one of the above 
described data distributions. We denote a data set by the name of the data 
distribution and the name of the population. For example, the data set 
(Zipf-cusp_max(0.5,1.0), b-var(200)) denotes a sampling of the set of buck- 
ets belonging to the population of b-var corresponding to the value 200 for 
the parameter b following the data distribution Zipf-cusp_max(0.5,1.0). 

We generate 23 different data sets classified as follows: (1) Zipf-t (i.e., 
Zipf data, different bucket density), containing the five data sets (Zipf- 
cusp_max(0.5,l), t-var(t)), for t=10, 100, 200, 300, 400, 500. (2) Zipf-b 
(i.e., Zipf data, different bucket size), containing the four data sets (Zipf- 
cusp_max(0.5,l), b-var(6)), for 6=100, 200, 500, 1000. (3) Gauss-t (i.e., 
Gauss data, different bucket density) , containing the five data sets (Gauss- 
rand, t-var(f)), for t=10, 100, 200, 300, 400, 500. (4) Gauss-b (i.e.. Gauss 
data, different bucket size), containing the four data sets (Gauss-rand, b- 
var(6)), for 6=100, 200, 500, 1000. (5) Zipf-z (i.e., Zipf data, different skew), 
containing the three data sets Zipf-cusp_max(z,1.0), p(20000,400,200)), for 
z=0.5, 1.0, 1.5. Recall that p(20000,400,200) denotes the population char- 
acterized by c = 20000, 6 = 400, t = 200. 

Each class of data sets is designed for studying the dependence of the 
accuracy of the various methods on a different parameter (parameter t mea- 
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suring the density of the bucket, parameter b measuring the size of the 
bucket and parameter z, measuring the data skew). For each data set, 1000 
different samples obtained by permutation of frequencies was generated and 
tested, in order to give statistical significance to experiments. 
Query set and error metrics: We perform all the queries S[d], for all 
1 < d < b. We measure the error of approximation made by the various 
estimation techniques on the above query set by using both: 

— the average of the relative error ^ where eJJ'^' is the relative 

error of the query with range d, i.e., eJJ*^' = ^'^^'^g^j^^'^^^ , and 

— the normalized absolute error, that is the ratio between the average ab- 
solute error and the overall sum of the frequencies in the bucket, i.e. 

v^b-l |5[d]-5[d]| 
2^d=l cb 

where S[d\ is the value of S[d\ estimated by the technique at hand. 

3.5.2 Results of Experiments and Discussion. In this section we give a 
qualitative discussion about the approximation error of the considered meth- 
ods, excluding USA and 1-biased, about which wc have already provided a 
theoretical analysis in Section 13.41 First we consider methods working sim- 
ply by splitting the original bucket, that are 2s, 4s and 8s. For all these 
methods, the estimation error may arise from the following approximation 
sources: 

1. the linear interpolation (i.e., CVA), concerning the evaluation of the 
query inside the "smallest" sub-buckets (for instance, in the case of the 
4s, the smallest sub-buckets are the quarts of the bucket), 

2. the numeric approximation, in case sums are stored by less than 32 bits 
(note that only 2s is not affected by this error). 

We call error of type 1 and 2, respectively, the above described components 
of the approximation error. 

Relative error vs data density. Concerning error of type 1, what we expect 
is that, for all methods, it increases as data sparsity increases. Indeed, in 
case of sparse data, the sum tends to concentrate in a few points, and this 
reduces the suitability of linear interpolation to approximate the frequency 
distribution. Moreover, we expect that such a component of the error de- 
creases as splitting degree increases: for instance, in case of 8s, which splits 
the bucket into 8 parts, we expect more accuracy (in terms of the error of 
type 1) than the 2s method. The reason is that having smaller sub-buckets 
means applying linear interpolation to shorter (and, thus, better linearly- 
approximable) segments of the cumulative frequency distribution. 

About error of type 2 we expect that both (i) it increases as the splitting 
degree increases and (ii) it is independent of data sparsity. Claim (i) is 
explained by considering that increasing the splitting degree means reducing 
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the number of bits used for representing the sum of sub-buckets. Claim (ii) 
is related to the numeric nature of the error. 

The observations above show the existence of a trade-off between the 
need of increasing the splitting degree for improving CVA precision on one 
hand, and the need of using as more bits as possible for representing partial 
sums in the bucket on the other hand. However, we expect that such a 
trade-off is more evident in case of high splitting degree, that is, when the 
error of type 2 is more relevant. For instance, recalling that the maximum 
absolute error of type 2 is ^Sft i where k is the number of bits assigned to 
smallest sub-buckets, being k — 4 for 8s and k — 8 for 4s, the maximum 
absolute error of type 2 for 8s in case c — 20000 is 625 (i.e., about the 3% 
of c) while it is 39 (i.e., a negligible percentage of c) for 4s. 

Experiments confirm the above considerations. By looking at graphs of 
Figure|31 (a) we may observe that for 2s and 4s the error decreases as the data 
density increases. On the contrary, for 8s, the error is quasi-constant (slightly 
increasing) in case of Zipf distributions, while it is slightly decreasing (but 
much less quickly than 4s) in case of Gauss distribution (see Figure 0| (a)). 
Concerning the comparison between 2s, 4s and 8s, we may observe in Figures 
01 (a) that for low values of data density, as expected, accuracy of 8s is higher 
than 4s and, in turn, accuracy of 4s is higher than 2s. But, as observed 
above, for increasing data density, trends of 4s and 8s suffer, in a different 
measure, the presence of the error of type 2. This appears quite evident 
in Figure 131 (a) , whereby we may note that 8s becomes worse than 4s from 
about 210 non null elements on and the improving trend of 2s is considerable 
faster than the other methods (since 2s does not suffer the error of type 2) . 

We observe that USA gives better estimation than lb on Zipf data (see 
Figures 151(a)). Accuracy of USA becomes the worst when the data sets 
follow the Gauss distribution (see Figures 01 (a)). This proves that the as- 
sumption made by USA can be applicable for particular distributions of 
frequencies and spreads, like those of data sets Zipf-t. Results obtained 
on data sets distributed according a Gauss distribution confirm the above 
claim: accuracy of USA becomes the worst when the data sets have a ran- 
dom distribution as it happens for Gauss-t (see Figure 01 (a)). 

Concerning lb we may observe that the behaviours of 15 and 2s are 
similar. As expected, the exploitation of the information that the bucket 
is 1-biased does not give a significant contribution to the accuracy of the 
estimation. Indeed, the knowledge of the position of just one element in the 
bucket does not add in general appreciable information. 

Consider now the usage of the tree-indices 3LT and 4LT. Recall that 3LT 
has the same splitting degree of 4s, since both methods divide the bucket 
into 4 sub-buckets. Possible difference in terms of accuracy between the two 
methods may arise from error of type 2. Indeed, the tree- like organization 
of indices allows us to represent the sum inside a given sub-bucket corre- 
sponding to a node of the tree as a fraction of the sum contained in the 
parent node, instead of the entire sum (as it happens for the "flat" meth- 
ods). Thus, we expect that tree-indices produce smaller errors of type 2. 
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(a): Error for different values of t 
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(b): Error for different values of b 
Fig. 3 Experimental Results for data sets Zipf 



However, as previous noted, 4s produces a negligible percentage of error of 
type 2. This explains why 3LT and 4s basically present the same error (lines 
in the graphs arc almost entirely overlapped). 

4LT has the same splitting degree as 8s (since both methods divide the 
bucket into 8 sub-buckets). As a consequence, being appreciable the error 
of type 2 of the 8s (as already discussed) , we may expect improvements by 
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the usage of 4LT. This is that results from experiments. 4LT has the best 
performances: it shows only benefits deriving from the increasing of data 
density (producing the reduction of error of type 1), with no appreciable 
increasing of error of type 2. 4LT, thanks to the tree- like organization of 
the sums, seems to solve the trade-off between increasing splitting degree 
(for improving CVA precision) and controlling numeric error arising from 
the usage of a reduced number of bits for representing sums. 



Relative error vs bucket size and data skew. First consider populations b- 
var. Recall that for such data sets we have maintained constant the data 
density around 20%. Thus, increasing the bucket size means increasing also 
non-null elements. While, as for previous experiments, error of type 2 is 
independent of the bucket size, (even though all the above considerations 
about the relationship between error of type 2, splitting degree and num- 
ber of bits per smallest sub-buckets are still valid), we expect that CVA 
precision suffers the variation of the bucket size. Indeed, on the one hand 
the CVA precision decreases as the bucket size increases, since, for a larger 
bucket, linear interpolation is applied to a larger segment of the cumulative 
frequency. But, on the other hand, increasing the bucket size means increas- 
ing the number of non-null elements (keeping constant the overall sum) and 
this means reducing the probability that the sum is concentrated into a few 
picks. Thus, whenever the cumulative frequency is smooth, linear interpo- 
lation tends to give better results. Depending on data distribution, we may 
observe either that the two opposite component compensate each other or 
one prevails over the other. Indeed, experiments with Zipf data, correspond- 
ing to Figure 121(b), show that methods have a quasi-constant trend (with 
a slight prevalence of the first component), while experiments conducted 
on Gauss data, corresponding to Figure 01 (b), show a net prevalence of the 
second component (all the methods present a decreasing trend for increas- 
ing bucket size). Such experiments do not give new information about the 
comparison between the considered methods, confirming substantially the 
previous results. Again 4LT has the best performance. 

Results of experiments conducted on the class of data sets Zipf-z, for 
measuring the dependence of the accuracy of methods on the data skew are 
reported in Figure[3 We note that all methods become worse as z increases 
(as it can be intuitively expected). The behaviours of lb and 2s are similar, 
while 4LT shows the best performance. 

As a final remark we may summarize the comparison between the con- 
sidered methods concluding that the worst method is always 2s, followed by 
8s and then by 3LT and 4s for sparse data. On the contrary, for dense data 
3LT and 4s show better performance than 8s. Observe that 4s and 3LT have 
basically the same accuracy. The best methods appears definitely 4LT. 
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(a): Data sets Gauss-t: error for different values of t 
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(b): Data sets Gauss-D; error for different values of b 
Fig. 4 Experimental Results for data sets Gauss 



4 Applying the 4LT Index to the Entire Histogram 

The analysis described in the previous sections suggests to apply the tech- 
nique of the 4-level tree index to a whole histogram in order to improve its 
accuracy on the approximation of the underlying frequency set. We stress 
that the problem of investigating whether such an addition is really con- 
venient is not straightforward: observe that 4LT buckets use 32 bits more 
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Fig. 5 Data sets Zipf-z: dependence on data skew 



than CVA ones, and, then, for a fixed storage space, allow a smaller num- 
ber of buckets. In this section we show how to combine the 4LT technique 
with classical methods for constructing histograms and we perform a large 
number of experiments to measure the effective improvement given by the 
usage of the 4LT. The advantage of the 4LT index is shown to be relevant 
also when it is compared with buckets using CVA, that is, when the stor- 
age space required by 4LT is larger than the original method. Moreover, 
the 4LT index shows very good performances if it is combines with a very 
simple method for constructing histograms, called EquiSplit, consisting on 
partitioning the attribute domain into equal-size buckets. Let us start with 
a quick overview of the most relevant methods proposed so far for the con- 
struction of histograms. 



4-.1 Methods for Constructing Histograms 

Besides the method used for approximating frequencies inside buckets, the 
capability of a histogram of accurately approximating the underlying fre- 
quency set strongly depends on the way such a set is partitioned into buck- 
ets. Typically, criteria driving the construction of a histogram is the min- 
imization of the error of the reconstruction of the original (cumulative) 
frequency set from the histogram. Partition rules proposed in |18II14| . try 
to achieve this goal. Among those, we sketch the description of two well- 
known approaches: MaxDijf and V-optimal (see |18II16) for an exhaustive 
taxonomy). Note that these methods are defined for 2-histograms but are 
in practice mainly used for 1-histograms to minimize storage consumption. 
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MsixDiff. A MaxDiff histogram [^18| of size h is obtained by putting a 
boundary between two adjacent attribute values Vi and w^+i of V if the 
difference between f{vi+i) ■ cTi+i and f{vi) ■ ai is one of the h — 1 largest 
such differences (where Ui denotes the spread of Vi). The product f{vi) ■ (Ji 
is said the area of Vi . 

V-Optimal. A V-Optimal histogram |18II14| gives very good performances. 
It is obtained by selecting the boundaries for each bucket, infi and supi, 1 < 
i < n, so that J27=i SSEi is minimal, where SSEi = X]j=^n/i (/O) ~ CL'^diY 
and avQi is equal to the average frequency in the i-th bucket, thus the 
cumulative frequency in the whole bucket divided by the size supi — infi + 1. 

We now propose to combine both methods, MaxDiff and V-Optimal, with 
the 4LT index in order to have an approximate representation of frequency 
distributions inside the buckets. We shall compare the so-revised methods 
with the original ones with CVA estimation at parity of storage consump- 
tion. The results will show that the 4LT index very much increases the esti- 
mation accuracy of both methods. The additional estimation power carried 
by the 4LT index even enables a very simple method like the one described 
below to produce very accurate estimations. 

EquiSplit. The attribute domain is split into k buckets of approximately 
the same size b = [m/fc]. In this way, as the boundaries of all buckets 
can be easily determined from the value 6, we only need to store a value 
for each bucket: the sum of all frequencies. This method has been first 
introduced in fS] and, as the experimental analysis will confirm, it has very 
good performances for low skewed data, while its performances get worse 
in case of high skew. 

4.2 Experiments on Histograms 

In this section we shall conduct several experiments both on synthetic and 
real-life data in order to compare the effectiveness of several histograms in 
estimating range query size. 

Experiments on Synthetic Data. First we present the experiments performed 
on synthetic data. Below we describe data sets, error metrics and the query 
set considered in our experiments. 

Available Storage: Note that under CVA each bucket stores only two inte- 
gers, while with the 4LT index each bucket needs three integers. Assuming 
32 bits the storage space for an integer, given a fixed K number of bits 
for the total storage space required for the whole histogram, both MaxDiff 
and V-Optimal under CVA produce [||J buckets while both of them with 
4LT indices only produce [^J buckets. On the other hand, a bucket for 
EquiSplit just needs one integer (the sum of all the frequencies), while for 
EquiSplit-4LT it needs two integers. Thus, for a fixed K number of bits for 
the total storage space, EquiSplit with CVA produces [-^J and EquiSplit 
with 4LT indices produces [^J as MDjCVA. 
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For our experiments, we shall use a storage space, that is 42 four-byte 
numbers to be in line with experiments reported in T8T14 , which we repli- 
cate. Using the above considerations, it can be easily realized that MaxDifF 
with CVA, V-Optimal with CVA, and EquiSplit with 4LT indices produce 
21 buckets, EquiSplit with CVA produces 42 buckets, and both MaxDifF 
and V-Optimal with 4LT indices only produce 14 buckets. 

Data Distributions: A data distribution is characterized by a distribu- 
tion for frequencies and a distribution for spreads. Frequency set and value 
set are generated independently, then frequencies are randomly assigned to 
the elements of the value set. We consider 5 data distributions: (1) Di: 
Zipf-cuspjmax(0.5,1.0). (2) D2 = Zipf-zrand(0.5,1.0): Frequencies are dis- 
tributed according to a Zipf distribution with the z parameter equal to 0.5. 
Spreads follow a ZRand distribution 16 with z parameter equal to 1.0 (i.e., 
spreads following a Zipf distributions with z parameter equal to 1.0 are ran- 
domly assigned to attribute values). (3) D3 = Gauss-rand: Frequencies are 
distributed according to a Gauss distribution with standard deviation 1.0. 
Spreads are randomly distributed. (4) D4 = Zipf-cusp-max (1.5,1.0). (5) 
= Zipf-cuspjrnax(3.0,1.0). 

Histograms Populations: A population is characterized by the value of 
three parameters, that are T, D and t and represents the set of histograms 
storing a relation of cardinality T, attribute domain size D and value set 
size t (i.e., number of non-null attribute values). 

Population Pi . This population is characterized by the following values for 
the parameters: D = 4100, t = 500 and T = 100000. 

Population P2 ■ This population is characterized by the following values for 
the parameters: D = 4100, t = 500 and T = 500000. 

Population P3. This population is characterized by the following values for 
the parameters: D = 4100, t = 1000 and T = 500000. 

Data Sets: Similarly to the experiments inside buckets, each data set in- 
cluded in the experiments is obtained by generating under one of the above 
described data distributions 10 histograms belonging to one of the popula- 
tions specified below. We consider the 15 data sets that are generated by 
combining all data distributions and all populations. 

All queries belonging to the query set below are evaluated over the his- 
tograms of each data set: 

Query set and error metrics: In our experiments, we use the query 
set {X < d : d ^ U} (recall that X is the histogram attribute and U 
is its domain) for evaluating the effectiveness of the various methods. We 
measure the error of approximation made by histograms on the above query 
set by using the average of the relative error ^ ^i'^'j where Q is the 

cardinality of the query set and e^*^' is the relative error , i.e., e^*^' = 

where Si and 5*^ are the actual answer and the estimated answer of the query 
i-th of the query set. 
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Table 1 Pop. 1: error for various methods. 
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Table 2 Pop. 2: error for various methods. 
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Table 3 Pop. 3: error for various methods. 

4-2.1 Results of the Experiments. In Tables ^ 13 and the results of ex- 
periments conducted on all data sets are reported. We denote the methods 
MaxDiff, V-Optimal and EquiSplit with CVA by MD, VO and ES, respec- 
tively; these methods with 4LT indices are denoted by MD_4LT, V0_4LT, 
ES_4LT. 

The cross behavior of the various methods is similar for the three popu- 
lations. Experiments confirm the good performance of the MaxDiff method 
and, particularly, of V-Optimal but they also pinpoint that 4LT adds to 
both methods relevant benefits. Indeed MD_4LT and V0_4LT show very 
low errors. Also EquiSplit and EquiSplit-4LT have good performances. But, 
as shown in Figure El(a), where the dependence of the estimation error on 
data skew is plotted, these methods quickly get worse for high data skew. 
Indeed, in such cases, the benefit given by the higher number of buckets 
is lost because of the high skew inside buckets. In case of high skew, par- 
tition rules play a central role, and the naive approach of EquiSplit is not 
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method 


data set A 


data set B 


ES 


4.32 


7.02 


ESALT 


0.97 


3.59 


MD 


11.30 


22.82 


MDALT 


1.63 


1.25 


VO 


4.49 


17.19 


VOALT 


1.86 


3.05 



Table 4 Errors obtained on real data. 



suitable. Interestingly, we observe that the improving of MaxDifF and V- 
Optimal by the usage of 4LT indices is relevant also for high skew, proving 
the effectiveness of such indices. In Figure El (b) we show the dependence 
of the accuracy of the methods on the amount of space. There, we consider 
the data distribution and the population Pi and generate 10 histograms 
belonging to Pi according to D4 for different amounts of space. The aim 
of this experiment is to study the behaviour of the various methods as the 
compression factor increases. Clearly, when the available amount of space 
increases, all methods behave well. The differences are more relevant for 
values corresponding to high compression. Methods using 4TL are the best. 
This can be intuitively explained by considering that in case of large buck- 
ets the role of the approximation technique inside buckets becomes more 
important than the rules followed for constructing buckets. 

Experiments on Real-Life Data. We have performed further experiments 
using real-life data. We have considered two data sets (that we denote by 
Data Set A and Data Set B) obtained from the 1997 U.S. Census Statistics 
|21| , by choosing two attributes of the table Special District Governments, 
having the following characteristics: 

Data Set A: attribute name: Type Code, domain size: D — 998, number 
of non-null attribute values: t = 787, cardinality: T — 34683. 
Data Set B: attribute name: Function Code, domain size: D = 99, number 
of non-null attribute values: t — 32, cardinality: T — 34683. 

We use for each histogram the same amount of storage space, that is 
21 four-byte numbers. Query set and error metrics are the same used for 
experiments on synthetic data. 

Results of the Experiments. As shown in Table 4, experiments on real 
data confirm the results obtained with synthetic data. We note that 4LT 
adds to MaxDiff and V-Optimal relevant benefits and both EquiSplit and 
EquiSplit-4LT have good performances. Not surprisingly, for the data set 
A, EquiSplit-4LT produces the smallest error. This can be explained by 
considering that data of this set are rather uniform, and, in this case, as 
discussed previously, the cheapest technique (in terms of storage space) gives 
the best performances. In other words, the extra storage space required for 
recording bucket boundaries of the more sophisticate techniques does not 
give benefits due to the trivial data distribution. 
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(a): Dependence of the accuracy on the data skew 
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(b): Dependence of the accuracy on the representation 
size (i.e., number of stored 4- byte integers) 

Fig. 6 Experimental Results 
5 Conclusions 

In this paper we have presented a technique for improving the frequency 
estimation within each bucket of a histogram. This technique goes beyond 
the simple methods used in the literature, that is, the continuous value as- 
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sumption and the uniform spread assumption. Our method is based on the 
addition of a 32 data item to each bucket organized into a 4-level tree index 
(4LT, for short) that stores, in a bit-saving approximate form, a number of 
hierarchical range queries internal to the bucket. We have shown both theo- 
retically and experimentally that such an additional information effectively 
allows us to better estimate range queries inside buckets. Interestingly, the 
usage of 4LT on top of histograms built through well-know techniques like 
MaxDiff and V-Optimal, outperforms such histograms in terms of accuracy. 
This claim is proven in the paper through a large number of experiments 
conducted on both synthetic and real-life data, where classical histograms 
combined with 4LT are compared with the standard versions (i.e., with no 
4LT) under several different data distributions at parity of consumed stor- 
age space. It turns out that the price we have to pay in terms of storage 
space by consuming 32 bits more per bucket w.r.t. CVA-based histograms is 
overcome by the benefits given by the improvement of precision in estimat- 
ing queries inside buckets. Thus, the main conclusion we draw is that the 
4LT index may represent a general technique that can be combined with 
any bucket-based histogram for significantly improving its accuracy. 
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